# A Survey of FPGA Based Deep Learning Accelerators: Challenges and Opportunities

Teng Wang<sup>1</sup>, Chao Wang<sup>2</sup>, Xuehai Zhou<sup>2</sup>, Huaping Chen<sup>1</sup>

<sup>1</sup> School of Software Engineering of USTC

<sup>2</sup> School of Computer Science and Technology of USTC

Suzhou, China

sa517368@email.ustc.edu.cn, cswang@ustc.edu.cn, xhzhou@ustc.edu.cn, hpchen@ustc.edu.cn

Abstract—With the rapid development of in-depth learning. neural network and deep learning algorithms have been widely used in various fields, e.g., image, video and voice processing. However, the neural network model is getting larger and larger, which is expressed in the calculation of model parameters. Although a wealth of existing efforts on GPU platforms currently used by researchers for improving computing performance, dedicated hardware solutions are essential and emerging to provide advantages over pure software solutions. In this paper, we systematically investigate the neural network accelerator based on FPGA. Specifically, we respectively review the accelerators designed for specific problems, specific algorithms, algorithm features, and general templates. We also compared the design and implementation of the accelerator based on FPGA under different devices and network models and compared it with the versions of CPU and GPU. Finally, we present to discuss the advantages and disadvantages of accelerators on FPGA platforms and to further explore the opportunities for future research.

Index Terms—FPGA, Accelerator, Deeplearning, Neural Network

#### I. INTRODUCTION

In recent years, the research of neural networks (NN) has been dramatically improved compared with traditional algorithms in the various fields. Various network models, such as Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), have been proposed for image, video and speech processing research domains. Well-trained CNN models have increased the classification accuracy of the top 5 images on the ImageNet dataset from 73.8% to 84.7%, and further improved object detection with its excellent feature extraction capabilities. RNN implements the latest word error rate in speech recognition. In general, due to a high ability of neural networks to fit a wide range of pattern recognition problems, it makes neural networks become a promising candidate for many artificial intelligence applications.

However, the neural network models are still suffering from a high computational and storage complexity. In the meantime, the research on neural networks is still focusing on the boost of the scale of neural network models by now. For example, the largest CNN model for 224x224 image classification requires up to 39 billion floating point operations (FLOP) and over 500 MB model parameters[1]. Since the computational complexity is directly proportional to the size of the input image, processing images with higher resolutions may require more than 100 billion operations.

Therefore, it is particularly important to choose a moderate computing platform for neural network applications. In general, CPUs can perform 10-100 GFLOP per second, but power efficiency is usually less than 1 GOP/J. As a consequence it is difficult to achieve the high-performance requirements of cloud applications and the low power requirements of mobile apps. In contrast, GPUs offer the peak performance of up to 10 TOP/s, therefore it is an excellent choice for high-performance neural networking applications. Moreover, programming frameworks like Caffe and Tensorflow also provide easy-to-use interfaces in GPU platform, making the GPU become the first choice for neural network acceleration.

In addition to CPU and GPU, FPGA is gradually becoming a candidate platform for energy-efficient neural network processing. FPGAs can achieve high parallelism and simplify logic according to the calculation process of a neural network with the hardware design for specific models. Some researches show that the neural network model can be simplified in a hardware-friendly manner without affecting the accuracy of the model. Therefore, FPGAs can achieve higher energy efficiency than CPUs and GPUs.

Back to 1990s when FPGA was born, the FPGA was not developed for the neural network at first, but for the rapid development of electronic hardware prototype. Since the birth of the neural network, people have been exploring its improvement and application, but they have not been able to confirm its development direction. Although in 1994 D.S. Reay first used the FPGA to realize the neural network accelerator, the course has not been paid attention to because of the development of the neural network itself. Until the birth of AlexNet in ILSVRC 2012, clarifying the direction of the development of neural networks, the research community is moving towards more in-depth and more complex network research. It is followed by the generation of models such as VGGNet, GoogleNet, ResNet and so on, which completely marks the trend of complex neural networks. At that time, researchers began to notice the FPGA-based neural network accelerator, as shown in Figure 1. Until last year, the number of FPGA-based neural network accelerators published in the IEEE eXplore has reached 69 and is still on the rise. It is enough to illustrate the research trend in this direction.



Fig. 1: Development history of the nerural network accelerator based on FPGA.

#### II. BACKGROUND

Deep learning combines low-level features to form more abstract high-level representation attribute categories or features to discover distributed feature representations of data.Its concept was proposed by Hinton et al. in 2006[17]. Based on Deep Belief Network (DBN), an unsupervised greedy layerby-layer training algorithm is proposed to bring about the hope of solving the deep structure-related optimization problems. Then the deep structure of multi-layer automatic encoder is proposed. Besides, the convolutional neural network proposed by Lecun et al. is the first right multi-layer structure learning algorithm[29], which uses relative spatial relations to reduce the number of parameters to improve training performance. In deep learning, a neural network is a bio-incentive model that typically includes multiple layers of neurons, and different algorithms are combinations between different network layers. Each layer receives the neurons of the previous layer as input. In the primary neural network layer, each neuron of the layer computes a weighted sum of all input neurons connected to it. The parameters of connection are called the weight. Because different neural network models are composed of different types of layers, these network layers are introduced one by one below.

 Fully connected layer. The fully connected (FC) layer implements the connection between each input neuron and the output neuron. It is expressed that there are a weight and a bias, which affect each element of the input and output.

$$F_{out} = F_{in} * Weight + bias \tag{1}$$

2) Convolutional layer. The convolutional (CONV) layer is used for two-dimensional neuron processes. The input and output neurons of this layer can be described as a two-dimensional feature map set,  $F_{in}$  and  $F_{out}$ . What's more, each feature map is called a channel. The CONV layer implements a two-dimensional convolution kernel  $K_{ij}$  for each input and output channel pair, implementing an offset factor  $b_i$  for each output channel. Also, the process of calculation in the CONV layer with N input channels and M output channels could be described as Equation 2.

$$F_{out}(j) = \sum_{i=0}^{N-1} conv2d(F_{in}(i), K_{ij}) + b_j$$

$$j = 0, 1, ..., M-1$$
(2)

- 3) Non-linear layer. A nonlinear function is generally used at the output of each neuron. It can selectively activate the output, also called the activation function. Commonly used in this layer are sigmoid, tanh and relu,
- 4) Pooling layer. The pooling layer is mainly used for two-dimensional neuron processes. The pooling layer downsamples each input channel separately, which helps to reduce the feature dimension. There are two general downsampling methods: average pooling and maximum pooling.
- 5) Element-wise layer. The element-wise layer is usually used in the RNN, and sometimes used in some CNN models. This layer receives two neuron tensors which are same in dimension and performs a binary operation on the corresponding neurons of the two tensors.

#### III. STATE-OF-THE-ART DEVELOPMENTS

#### A. Acceleration methods

At present, the acceleration methods for neural networks are mainly divided into two types, software design optimization, and hardware design improvement.

The primary goal of software design optimization is to reduce the computation or bandwidth requirements of the neural network model with keeping accuracy. In general, there are roughly three ways, which are optimization of algorithm procedure [2], data quantification and weight reduction [3]. The optimization of algorithm procedure is mainly for the characteristics of different neural network models, and the calculation process is simplified or transformed without affecting the result, thereby achieving the purpose of reducing the computation and reducing the bandwidth requirement. Data quantification is primarily the quantification of weights and neurons to reduce the bandwidth and storage requirements in neural network computing. Moreover, the last one, weight reduction, is to use a low-rank matrix to approximate the weight matrix so that the actual weight is reduced, reducing the total calculation of the model.

The improvement of hardware design mainly points to the characteristics of neural network algorithms to improve the existing logical unit structure so that it can execute deep learning algorithms efficiently and quickly.[5]

#### B. Acceleration platforms

At present, hardware platforms that can be accelerated for neural networks mainly include GPUs, FPGAs, and ASICs. The primary survey here is the FPGA platform in this article.

Although FPGAs do not have the advantages of faster operation, lower power consumption, and cheaper mass production in ASIC platforms under the same design, due to their editable logic arrays, FPGAs have shorter design cycles. For the GPU, because of the CUDA general-purpose parallel computing framework, the design solution is also very convenient and fast, but the power consumption of the GPU is higher. So comprehensively, for the same parallel implementation, using the FPGA platforms can achieve good performance and efficiency improvement in a shorter time.

#### C. The focus of the neural network algorithm acceleration

According to the results of previous surveys, most of the research relies on the roofline model [28] for the theoretical analysis of FPGA accelerators. In this model, the X-axis of the roofline model means the system's computational communication ratio, CTC, and the Y-axis is the peak computing power of the system. This model describes the relationship between the computing power of the system and the communication bandwidth.

The meaning of CTC is the number of operations per memory access. Each hardware design can be seen as a point in the diagram. Therefore y/x is equal to the bandwidth requirement in this design. With a given particular platform, there are two constraints to optimizing its performance. Firstly, the theoretical bandwidth is limited, and the actual bandwidth

upper limit is lower than the theoretical upper limit. It depends on the data access mode because of DDR access. Another constraint is hardware computing power, which is limited by the resources available on the FPGA. Generally speaking, the hardware resources that can be used are fixed in FPGA. Therefore, how to improve the CTC ratio means that the hardware is more likely to reach the calculation limit, which is the challenge for FPGA neural network acceleration design.

For the previous research, they mainly realized the increase of CTC ratio from the following aspects, to achieve the purpose of optimizing the neural network accelerator.

1) Common characteristics of the algorithm: For many neural network algorithms, the impact of different parts of the algorithm on the execution time of the entire algorithm is different during execution. Nonetheless, for any neural network algorithm, there are many common features, and a more general accelerator can be designed for these features. The common features of each neural network algorithm are matrix operations, nonlinear activation functions and huge internal parameters.

For the matrix calculation in the neural network, there are im2col, Winograd-based method [5], loop unrolling and matrix sparsity analysis, which can increase the data multiplexing of each calculation. Thereby it can reduce the total number of memory accesses and increase the CTC ratio [2, 6]. As in Equation 3.

$$CTC \ Ratio = \frac{total \ operation \ times}{total \ external \ data \ access \ times} = \frac{2*R*C*M*N*K^2}{\alpha_{in}*B_{in} + \alpha_{\omega}*B_{\omega} + \alpha_{out}*B_{out}}$$
(3)

For the optimization of nonlinear activation functions, there is currently little research on this direction. The only direction is the linear fitting of the nonlinear activation function, but the effect is not ideal.

However, for the reduction of internal parameters, many studies have been carried out. By reducing the internal parameters which mean the storage size of the weight matrix, thereby reducing the bandwidth requirement for the hardware platform, it is equivalent to increasing the CTC ratio.

In figure 2, the meaning of the abscissa is each network with information which is (number of bits of the weight matrix) x (number of bits of the neuron matrix), and FT indicates that the network is adjusted after quantized. The ordinate is the loss of the correct rate of each model after the changes to the original model. It can be seen that for linear quantization if we want to achieve the accuracy without losing its loss after the transformation, it is best not to reduce the bit of data to less than eight digits.

2) Parallel Neural Network Algorithm: Computational parallelization is the most commonly used acceleration method for neural network algorithms. The use of task-level parallelism, data-level parallelism, and hardware-level parallelism is the main parallel processing of accelerator optimization. In essence, the essence of the parallel neural network algorithm



Fig. 2: Comparison of different data quantification methods [3, 7, 8, 9, 10, 11, 23].

is to parallelize the calculation of the core of the algorithm, in order to achieve better acceleration.

Task-level parallelism involves the optimization of software systems. Due to the various open frameworks used in various studies, no relevant resources have been found for the time being.

For data-level parallelism, many studies now use a double-buffer mechanism to speed up the entire computation time by cover the time cost of data transmission at the time of computation in the computational unit [2].

There are many ideas in hardware-level parallelism, such as calculation units for reconfiguring each layer, and a compromise parameter scheme for all layers [5], or a hardware platform with different parameter combinations configured in advance [12]. There is a solution that accelerates by thermally switching to different platform calculations in the computational process. Also, pipeline technology is often used in accelerators to increase throughput [12].

### IV. DESIGN OF FPGA-BASED DEEP LEARNING ACCELERATORS

This section is mainly a summary of the previous research on the papers of the neural network accelerator based on FPGA. In general, the starting point of the problems and the accelerator schemes designed by these papers are very different. However, depending on the type of problems to be solved, the current researches are classified into four categories, which are specific to each: accelerators for a specific application, accelerators for specific algorithms, accelerators for common features of algorithms, and general accelerator frameworks with hardware templates. These four categories follow a process from customized to general, and the design difficulty is increasing. For the first two types of problems, design accelerators are currently more common, and the design difficulty is relatively small. For the latter two categories, especially the last category, the design difficulty is relatively

large, and it is still in the research stage and has not been popularized.

#### A. Designing Accelerators for Specific Applications

Utilizing FPGA design accelerators for specific problems is currently the most extensive area of FPGA accelerator applications. Designing an accelerator specifically for a specific problem, it not only fits the problem well but also has a relatively small design difficulty. Designing accelerators for specific problems often speed up the reasoning process of deep learning algorithms rather than the learning process.

The paper [13] used FPGA to design a dedicated acceleration device to implement the LSTM algorithm to achieve an efficient speech recognition engine (ESE).

Long and short time memory neural networks (LSTM) have a wide range of applications in speech recognition. To speed up predictions and save energy, they used a load-balanced sensing pruning method that compresses the LSTM model size by 20x (10x form pruning, 2x form quantization) with negligible loss of prediction accuracy. The compressed model is then encoded and split into multiple PEs for parallelism, and a complex LSTM data stream is scheduled using a separately designed scheduler. Finally, an ESE hardware architecture that directly runs the sparse LSTM model is implemented.

The ESE is implemented in a Xilinx XCKU060 FPGA operating at 200MHz and operates directly on a sparse LSTM network with a performance of 282GOPS, corresponding to a 2.52 TOPS on a dense LSTM network. Moreover, it processes a full LSTM for speech recognition with a power dissipation of 41 Watts. Evaluated on the speech recognition benchmark LSTM, ESE is 43x faster and 3x faster than the Core i7 5930k CPU and Pascal TitanX GPU. Compared with the CPU and GPU, the energy efficiency of 40x and 11.5x is improved respectively.

#### B. Designing Accelerators for Specific Algorithms

The use of FPGA design accelerators for a specific neural network model algorithm is currently a hot research topic in the direction of accelerators. The main reason is that when an accelerator designed for a specific neural network algorithm is applied to a specific problem, it usually only needs to be configured with specific parameters or some small changes to fit this problem well.

The current mainstream research on this piece focuses on two deep learning algorithms, Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN), which are listed below:

1) Convolutional neural network: Convolutional neural network (CNN) is widely used because it utilizes the behavior of biological optic nerve to achieve higher recognition accuracy in image recognition direction. Recently, various applications based on deep neural network algorithms have emerged in an endless stream. For rapid training and use, each research points to the FPGA platform. Especially because FPGA platforms have the advantages of high performance, reconfigurability and rapid development, etc., various accelerators based on this are proposed for deep convolutional neural networks. Although the usual FPGA accelerators have proven to have better performance than the average processor, the exploration of accelerator design has not reached the end.

The paper [27] argues that the accelerator still has a significant problem, that is, the computational process may not match the FPGA platform memory bandwidth very well. As a result, existing methods fail to achieve optimal performance due to failure to take full advantage of logical resources or memory bandwidth. At the same time, the increasing complexity and magnitude of deep learning applications exacerbate this problem. To overcome this problem, they use the analytical design mechanism of the roofline model. For any CNN design, quantitatively analyze its computational memory and required memory bandwidth by using different optimization methods, such as loop unrolling and transformation. Moreover, then, using the roofline model approach, the solution with the best performance and lowest FPGA resource requirements was confirmed. By the paper [2], the double-buffer mechanism is used to optimize the memory access, and the CTC is further improved. Finally, a convolutional neural network accelerator is implemented on the VC707 FPGA platform.

The final result of the experiment was that the peak performance of the hardware reached 61.62 GFLOPS at 100 MHz operating frequency. The double-buffer mechanism and the unique computing unit of the paper can be used in other algorithms to further realize the vision of the universal accelerator for neural networks.

2) Recurrent neural network: Long short-term memory recurrent neural networks (LSTM-RNs) is a kind of Recurrent neural networks (RNNs). It has a wide range of applications in speech recognition, machine translation, and scene analysis. In this paper [15], they proposed an FPGA-based LSTM-RN accelerator that optimizes computing performance and

communication requirements. The peak performance of the accelerator is finally achieved at 7.26 GFLOP/S.

It can be seen that the accelerator structure of the paper [15] is the same as that of the paper [2], except that the LSTM buffer area is added in the middle to directly save the state parameters of the LSTM hidden layer, eliminating the need to reload the parameters each time. , reducing the bandwidth requirements for hardware. The idea can be generalized by adding a direct cache between each layer so that the next level of computation does not require data loading in the process of computing a neural network model.

#### C. Design Accelerator for Common Features of Algorithms

The purpose of the two previous design accelerators is dedicated, but there are some practices in the process of designing accelerators that can be ported to accelerator designs for other algorithms. Thus, the accelerator designed for the characteristic features of the algorithm can be better implemented. At present, according to the papers that are known, their general methods for processing accelerator design are computational optimization and memory access optimization.

1) Calculation Optimization: Most of the deep learning algorithms involve a large number of large-scale matrix operations in the process of learning or reasoning. These matrix operations generally require a large number of computing resources, so they are often the core of the algorithm. Therefore, accelerating the matrix operations involved can effectively improve the overall performance of the algorithm.

The paper [2] optimizes the loop code of the matrix operation. First, the loop unrolling method is used to convert the original loop code, and according to the relationship between the loop variables of each loop and the arrays in the loop, the following three categories are divided:

- 1) Irrelevant. If a loop variable  $i_k$  does not appear in any access function of array A, then the corresponding loop dimension is said to be irrelevant to array A.
- 2) Independent. If the data space set is accessed on array A is completely separable by a certain loop size  $i_k$  or for any given two distinct parameters  $p_1$  and  $p_2$ . For example, the access data  $DS(A, i_k = p_1) = \bigcup Image(F_S^A, (D_S \cap i_k = p_1))$  is disjoint with  $DS(A, i_k = p_2) = \bigcup Image(F_S^A, (D_S \cap i_k = p_2))$ , then the loop size ik Is independent of array A.
- Dependent. If the data space set on array A is accessed non-divisible by a certain loop size i<sub>k</sub>, then the loop size i<sub>k</sub> is dependent with array A.

The paper used this method to select parameter and design hardware to achieve the effect of hardware acceleration. However, for the optimization of the calculation part, primarily every research has involved, and it can be roughly classified into the above method. Another is to use the sparsity of the weight matrix, and it performs the calculation skip operation when detecting the value of weight is 0 [6].

2) Memory access optimization: Design variables with higher computational roof do not necessarily achieve higher performance under the constraints of memory bandwidth. The

#### CODE 1 Compting Flow of Convolution Layer

**INPUT:** The feature map of last layer, input\_fm, (S\*R+K, S\*C+K, N); The convolution kernel, weights, (K, K, M); The size of stride, S

**OUTPUT:** The feature map of this layer, output\_fm, (R, C, M)

```
1: function Cal CONV
       for row = 0 to R by T_r do
2:
3:
          for col = 0 to C by T_c do
              for tm = 0 to M by T_m do
4:
                  for tn = 0 to N by T_n do
5:
                     //load output feature maps
6:
                     //load weights
7:
                     //load input feature maps
8:
        L: foo(output fm(tm, row, col), weights(tm, tn), in-
   put_fm(tn, row, col));
                      //store output feature maps
9:
                  end for
10:
                  //store output feature maps
11:
12:
              end for
          end for
13:
       end for
14:
15: end function
```

paper [6] has also been optimized to reduce traffic with efficient data multiplexing. Code 1 illustrates the memory transfer operation of the CNN layer. The feature map and weight of the input and output are loaded before the calculation engine starts, and the output feature map is written back to the main storage.

If the most inner loop of the communication part (the loop size  $t_i$  in Code 1 is independent of the array, there will be redundant memory operations between different loops. The most inner loop variable  $t_i$  is independent of the array output\_fm. Thus, for accessing the array output\_fm can be placed into the outer loop. With this memory optimization, the total count of array output\_fm memory access operations is reduced from  $\frac{2*M*N*R*C}{T_m*T_n*T_c}$  to  $\frac{M*R*C}{T_m*T_r*T_c}$ 

The research of memory access optimization is also in various researches. The space for further optimization still exists. For example, the load output and the storage output of the above code are combined into one. While calculating the output, the output of this layer is placed in the input buffer of the next layer in order, which can further save time.

# D. Designing a Universal Accelerator Framework with Hardware Templates

Using a hardware template design accelerator is a more general approach than previous accelerator design methods. Generally, these hardware templates are often the implementation of the FPGA version of an individual programming model. In the process of using, the user only needs to design a small part of the module and configure the parameters. When the parameters and the module are determined, the accelerator

framework can run automatically to speed up problems that users have to solve.

With the development of C to RTL tools, users can use C language instead of Verilog / VHDL language when designing specific modules, which greatly simplifies the design of users and promotes the popularization of hardware template framework. At present, only a few papers have been found to study and explore this direction.

The paper [16] proposed FP-DNN (Field Programmable DNN), an end-to-end framework, which uses the DNN described by TensorFlow as input, and automatically generates hardware implementation on the FPGA board with RTL-HLS hybrid template. FP-DNN uses a high-performance computing engine and a well-designed communication optimization strategy to model the DNN.

The implementation of the paper in Table 1 is compared to the implementation on the CPU and GPU. At the same time, multiple neural network models were selected as benchmarks: VGG[20], LSTM[21], and Res-Net[22]. Performance was evaluated in GOP/S and energy efficiency was evaluated in GOP/J.

They used both data quantization strategies in these three models and compared the model accuracy of 32-bit floating point and 16-bit fixed points in Table 1. The top-5 precision of VGG-19 and Res-152 was tested on the ImageNet dataset for evaluation. The LSTM-LM model was evaluated using perplexity of the model tested on the PTSB data set. The lower perplexity means the better model performance in the language modeling task.

It can be seen that, while maintaining the same calculation result, the 16-bit fixed point computing performance of the FPGA platform is basically about 2X-3X of the CPU platform, and the energy utilization rate is also nearly 20X. It thoroughly explains the rationality of the framework of the paper. However, compared with the GPU platform, the computing performance has not been significantly exceeded. It can also be seen that in Table 1, compared with other papers, the direct design of FPGA accelerator, using FP-DNN is not good enough under the same conditions for the final performance and power consumption. So in this direction, further exploration and improvement are needed.

For the research of the general hardware accelerator framework, there are still few studies and no more information. There is a need to further validate the methods of research in this area.

## E. Comprehensive comparison of current accelerator performance

In Table 1, we could observe the performance and power consumption of the current mainstream FPGA-based neural network accelerators under different network models, different hardware types and different external parameters. Firstly, according to the types of network models used, this paper divides them into three categories: VGG, LSTM, and Res-Net. Moreover, the precision of the parameters of the model used are given. At the same time, the table also lists the platform,

TABLE I: Performance comparison of different networks on different platforms

|              |          |                                      | VGG     |           |            |         |       |       |
|--------------|----------|--------------------------------------|---------|-----------|------------|---------|-------|-------|
|              |          | <b>a</b>                             |         |           |            |         |       |       |
| Model        | Platform | Specification Types Frequnce Mermory |         | Precision | GOP/s      | GOP/j   | Power |       |
| VGG-19 [16]  | CPU      | Types Xeon E5-2650v2                 | 2.6GHz  | Mermory   | float32    | 119     | 0.63  | 95W   |
| VGG-19 [16]  | GPU      | GTX TITAN X                          | 1002MHz | 12G GDDR5 | float32    | 1704    | 6.82  | 250V  |
| VGG-19 [10]  | FPGA     | Stratix-V GSD8                       | 120MHz  | 32G DDR3  | fixed8     | 117.8   | 6.17  | 19.1V |
| VGG-16 [25]  | FPGA     | Stratix-V GSD8                       | 200MHz  | on-chip   | fixed16    | 821     | 0.17  | 19.11 |
| VGG-16 [34]  | FPGA     | Arria 10 SX660                       | 120MHz  | - DDR4    | 8-bit      | 53      | 13.9  | 3.3W  |
| VGG-16 [34]  | FPGA     | Arria 10 GX 1150                     | 150MHz  | 8G DDR3L  | fixed8/16  | 645.25  | -     | 3.3 1 |
| VGG-16 [23]  | FPGA     | Arria 10 GX 1150                     | 240MHz  | - DDR3    | fixed8/16  | 968.03  | _     | _     |
| VGG-16 [38]  | FPGA     | Stratix 10 GX 1130                   | 300MHz  | - DDR3    | fixed8/16  | 1604.57 | -     | -     |
|              |          |                                      |         |           |            |         | 20.75 | 10.18 |
| VGG [31]     | FPGA     | Arria 10 GX 1150                     | 370MHz  | 1G DDR4   | float-     | 866     | 20.75 | 19.1V |
| VGG [31]     | FPGA     | Arria 10 GX 1150                     | 385MHz  | 1G DDR4   | fixed16    | 1790    | 47.78 | -     |
| VGG-S [36]   | FPGA     | XCKU115                              | 125MHz  | off-chip  | fixed32    | 394.7   | 14.6  | 27W   |
| VGG-D [33]   | FPGA     | Virtex 7 VX690T                      | 200MHz  | off-chip  | fixed8     | 1467.6  | -     | _     |
| VGG-A [33]   |          |                                      |         |           | fixed8     | 1500    | -     |       |
| VGG-16 [37]  | 15xFPGAs | XC7VX690T                            | -       | off-chip  | fixed 16   | 1197*   | 37.88 | _     |
| VGG-19 [37]  |          |                                      |         | 1         | fixed16    | 1220*   | 38.13 |       |
| VGG-19 [16]  | FP-DNN   | Stratix-V GSMD5                      | 150MHz  | 4G DDR3   | float32    | 81      | 3.24  | 25W   |
|              |          |                                      |         |           | float16    | 364.36  | 14.57 |       |
|              |          |                                      | LSTM    |           |            |         |       | 1     |
| LSTM-LM [16] | CPU      | Xeon E5-2650v2                       | 2.6GHz  | -         | float32    | 103     | 0.54  | 95W   |
| LSTM-LM [16] | GPU      | GTX TITAN X                          | 1002MHz | 12G GDDR5 | float32    | 1828    | 7.31  | 250V  |
| LSTM [13]    | FPGA     | XCKU060                              | 200MHz  | 8G DDR3   | fixed16/12 | 282.2   | 6.87  | 41W   |
| LSTM [15]    | FPGA     | Virtex7-485t                         | 150MHz  | - DDR3    | float32    | 7.26    | -     | 19.63 |
| Bi-LSTM [32] | FPGA     | Zynq XCZU7EV                         | 266MHz  | on-chip   | fixed1/8   | 1833    | -     | -     |
| LSTM-LM [16] | FP-DNN   | Stratix-V GSMD5                      | 150MHz  | 4G DDR3   | float32    | 86      | 3.44  | 25W   |
|              |          |                                      |         |           | float16    | 315.85  | 12.63 |       |
|              |          |                                      | Res-Net |           |            |         |       | •     |
| Res-152 [16] | CPU      | Xeon E5-2650v2                       | 2.6GHz  | -         | float32    | 119     | 0.63  | 95W   |
| Res-152 [16] | GPU      | GTX TITAN X                          | 1002MHz | 12G GDDR5 | float32    | 1661    | 6.60  | 250V  |
| Res-152 [24] | FPGA     | Arria 10 GX 1150                     | 150MHz  | -         | float16    | 315.5   | -     | -     |
| Res-50 [24]  | FPGA     | Arria 10 GX 1150                     | 150MHz  | -         | float16    | 285.07  | -     | -     |
| Res-50 [35]  | FPGA     | Stratix-V GSD8                       | 200MHz  | on-chip   | fixed16    | 973     | -     | -     |
| Res-50 [26]  | FPGA     | Stratix $^{TM}$ 10                   | 750MHz  | -         | float32    | 15000   | 85    | -     |
| Res-50 [38]  | FPGA     | Arria 10 GX 1150                     | 240MHz  | - DDR3    | fixed8/16  | 599.61  | -     | _     |
| Res-152 [38] |          |                                      |         |           | fixed8/16  | 697.09  | -     |       |
| Res-50 [38]  | — FPGA   | Stratix 10 GX 2800                   | 300MHz  | - DDR3    | fixed8/16  | 651.49  | _     |       |
| Res-152 [38] |          |                                      |         |           | fixed8/16  | 789.44  | _     | -     |
| Res-152 [16] | FP-DNN   | Stratix-V GSMD5                      | 150MHz  | 4G DDR3   | float32    | 73      | 2.92  | 25W   |
|              |          |                                      |         |           |            |         | 1     |       |

<sup>\*</sup> represents that the value is the measured value of each FPGA

hardware types and related parameters in these paper. At the end of the table, the experimental results are shown, which are GOP/s, GOP/j and power respectively.

It can be seen from the table that whether increasing frequency, changing memory types and reducing parameter precision appropriately have a positive impact on the accelerator. Among these papers, the impressive one is multi-FPGA cluster used in the paper [37], which contains 15 pieces of FPGA chips. By effectively connecting 15 FPGAs with the workload and weight balancing, an average of about 1200 GOP/s and 38GOP/j per chip of FPGA is achieved. It is even better than that of the neural network accelerator with single FPGA in the GOP/s, and also achieves high energy utilization. It provides a new way of thinking for the current research.

#### V. OPPOTUNITIES AND CHALLENGES

As early as the 1960s, Gerald Estrin proposed the concept of reconfigurable computing. It was not until 1985 that the first FPGA chip was introduced by Xilinx. Although the parallelization and power consumption of the FPGA platform is excellent, it has not been paid attention to because of its cost of reconfiguration and high programming complexity. With the continuous development of deep learning, due to the high parallelism of its applications, more and more researchers are investing in the research of FPGA-based deep learning accelerators, which is also the trend of the times.

#### A. Advantages of FPGA based accelerators

- 1) High performance with low energy: The advantage of high energy efficiency is not to be overstated, and many previous studies have shown this fact. It can also be seen from Table 1 that the GOP/j on the FPGA platform can reach tens of times on the CPU platform, and the lowest level is the same level of energy efficiency on the GPU platform. It is enough to illustrate the advantages of FPGA-based neural network accelerators with high performance and low energy consumption.
- 2) High parallelism: High parallelization is the main property of choosing an FPGA platform to accelerate deep learning. Thanks to the FPGA's editable logic hardware unit, we can easily optimize the hardware with the parallelization algorithm to achieve high parallelism.
- 3) Flexibility: Due to the reconfigurability of the FPGA, it could be applied to complex engineering situations. For instance, after the hardware design and application design is completed, it is found through experiments that the performance does not reach the ideal situation. Reconfigurability enables FPGA-based hardware accelerators to handle frequent design changes well and satisfy the changing needs of users. Therefore, this flexibility is also a bright spot on FPGA platforms compared to ASIC platforms.
- 4) Sercurity: In the current era of artificial intelligence, more and more data is used for training, so data security is becoming more and more critical. Accordingly, as a carrier of data, the security of computers becomes significant. At present, the first reaction about computer security problems is that

various anti-virus software protects computers. The software can only act as a passive defender and cannot eliminate security risks. In contrast, security can be better enhanced from the hardware architecture level.

#### B. Dissdvantanges of FPGA based accelerators

- 1) Reconfigurable Cost: The reconfigurability of the FPGA platform is also a double-edged sword. Although it gives us many advantages in computational acceleration, it cannot be ignored that the cost of time in reconfiguration about tens of minutes to hours in the configuration of FPGAs for different designs. Moreover, the reconfiguration process is divided into two types: static reconfiguration and dynamic reconfiguration. Static reconfiguration, also known as compiletime reconfiguration, refers to the ability to configure the hardware to configure one or more functions of the system before the task runs, and locks it before the task finishes. The other is also known as runtime reconfiguration. Dynamic reconfiguration of hardware using context configuration mode. During the execution of the task, the hardware module is reconfigured as needed, but it is very susceptible to delays, and increase the runtime.
- 2) Programming Difficulty: Although the concept of reconfigurable computing architecture has long been proposed, and there has been much more mature work, reconfigurable computing has not gained popularity before. There are two reasons for this:
  - The 40-year time from the appearance of reconfigurable computing to the early 21st century is the golden age of Moores Law, during which technology updates every year and a half. So the performance improvements brought by the architectural updates are not as direct and forceful as technology updates;
  - With a mature system, traditional programming on the CPU adopts high-level abstract programming languages. However, reconfigurable computing requires hardware programming, generally using hardware programming Languages (Verilog, VHDL.) that would cost programmers much time to master.

#### C. Expectation

Although the neural network accelerator based on FPGA still has such and such problems, the future development is expected. Through the overview of this article, the following issues need further study in this direction:

- 1) Optimization in the rest of the computation process. At present, mainstream research is placed in the loop part of the matrix operation, and the calculation of the activation function is only a few people involved.
- 2) Access optimization. Further research is needed for other optimization methods for data access.
- 3) Data optimization. Using lower bit data can naturally improve the performance of the platform, but most of them make weights and neurons in same bit width, but the difference bit width with the non-linear map can also

- be improved in figure 2. So there should be a better balance status to be explored.
- 4) Frequency optimization. At present, the operating frequency of most of the FPGA platforms studied is 100∼300MHz, but the theoretical frequency of the FPGA platform is more. The main reason is that it is limited by the route between the on-chip SRAM and DSP. Further research is needed if there is a way to solve or avoid it.
- 5) The integration of FPGAs. According to the performance of the paper [37], if the problems of scheduling and allocation can be handled well, the multi-FPGA cluster can achieve better results. Moreover there is not much research on this direction at present. So it is worth for this direction to explore further.
- 6) Automatic configuration. To solve the problem of complex programming on the FPGA platform, if there is a more user-friendly automatic deployment framework, similar to NVIDIA's CUDA (Compute Unified Device Architecture), it will make the application scope wider.

#### VI. CONCLUSION

Accelerating deep learning algorithms is a study that has increased many attentions in recent years. The current mainstream platform is the GPU cluster. Although FPGA/ASIC also has such a good acceleration capability, it is only popularized in the research field due to programming complexity and other issues. In this survey, we have investigated the design and implementation of the FPGA based accelerators by the order from custimized to general, compared the performance and power consumption of different designs, and summarized some directions for further research in this field. The description of the state-of-the-art shows that, FPGAs is used to accelerate neural network computing due to the high-performance features of FPGAs, and the cutting-edge accelerator research is mostly based on the platform, but the future of neural network accelerators is still very promising.

#### VII. ACKNOWLEDGEMENT

This work is partially supported by the National Key Research and Development Program of China (under Grant 2017YFA0700900), National Science Foundation of China (No. 61379040), Anhui Provincial Natural Science Foundation (No.1608085QF12), Jiangsu Provincial Natural Science Foundation (No. BK20181193), Youth Innovation Promotion Association CAS (No. 2017497), and Fundamental Research Funds for the Central Universities (WK2150110003). The authors would like to thank all the reviewers for their valuable feedbacks and suggestions. Chao Wang is the corresponding author of this paper.

#### REFERENCES

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

- [2] Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing fpga-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 161170.
- [3] Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song, et al. 2016. Going deeper with embedded fpga platform for convolutional neural network. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2635.
- [4] Chao Wang, Lei Gong, Qi Yu, Xi Li, Yuan Xie, Xuehai Zhou: DLAU: A Scalable Deep Learning Accelerator Unit on FPGA. IEEE Trans. on CAD of Integrated Circuits and Systems 36(3):513-517(2017)
- [5] Liqiang Lu, Yun Liang, Qingcheng Xiao, and Shengen Yan. 2017. Evaluating fast algorithms for convolutional neural networks on fpgas. In Field-Programmable Custom Computing Machines (FCCM), 2017 IEEE 25th Annual International Symposium on. IEEE, 101108.
- [6] LIU Qinrang and LIU Chongyang. 2018. Calculation Optimization for Convolutional Neural Networks and FPGA-based Accelerator Design Using the Parameters Sparsity. Journal of Electronics & Information Technology. Vol.40. No.6. Jun. 2018.
- [7] Kaiyuan Guo, Lingzhi Sui, Jiantao Qiu, Jincheng Yu, Junbin Wang, Song Yao, Song Han, Yu Wang, and Huazhong Yang. 2017. Angel-Eye: A Complete Design Flow for Mapping CNN onto Embedded FPGA. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2017).
- [8] Song Han, Huizi Mao, and William J Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015).
- [9] Fengfu Li, Bo Zhang, and Bin Liu. 2016. Ternary weight networks. arXiv preprint arXiv:1605.04711 (2016)
- [10] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. 2016. DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160 (2016).
- [11] Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. 2016. Trained ternary quantization. arXiv preprint arXiv:1612.01064 (2016).
- [12] Chen Zhang, Di Wu, Jiayu Sun, Guangyu Sun, Guojie Luo, and Jason Cong. 2016. Energy-Efficient CNN Implementation on a Deeply Pipelined FPGA Cluster. In Proceedings of the 2016 International Symposium on Low Power Electronics and Design. ACM, 326331.
- [13] Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, et al. 2017. ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA.. In FPGA. 7584.
- [14] Lei Gong, Chao Wang, Xi Li, Huaping Chen, Xuehai Zhou: MALOC: A Fully Pipelined FPGA Accelerator for Convolutional Neural Networks With All Layers Mapped on Chip. IEEE Trans. on CAD of Integrated Circuits and Systems 37(11):2601-2612(2018)
- [15] Yijin Guan, Zhihang Yuan, Guangyu Sun, and Jason Cong. 2017. FPGA-based accelerator for long short-term memory recurrent neural networks. In Design Automation Conference (ASP-DAC), 2017 22nd Asia and South Pacific. IEEE, 629634.
- [16] Yijin Guan, Hao Liang, Ningyi Xu, Wenqiang Wang, Shaoshuai Shi, Xi Chen, Guangyu Sun, Wei Zhang, and Jason Cong. 2017. FP-DNN: An Automated Framework for Mapping Deep Neural Networks onto FPGAs with RTL-HLS Hybrid Templates. (2017), 152159.
- [17] Hinton G E, Osindero S, Teh Y W. A fast learning algorithm for deep belief nets[J]. Neural computation, 2006, 18(7): 1527-1554.
- [18] Chao Wang,Xi Li,Yunji Chen,Youhui Zhang,Oliver Diessel,Xuehai Zhou: Service-Oriented Architecture on FPGA-Based MPSoC.IEEE Trans. Parallel Distrib. Syst.28(10):2993-3006(2017)
- [19] Guo K, Zeng S, Yu J, et al. A Survey of FPGA Based Neural Network Accelerator[J]. arXiv preprint arXiv:1712.08934, 2017.
- [20] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- [21] Wojciech Zaremba, Ilya Sutskever, et al. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.
- [22] Kaiming He, Xiangyu Zhang, et al. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
- [23] Suda N, Chandra V, Dasika G, et al. Throughput-optimized openclbased fpga accelerator for large-scale convolutional neural net-

- works[C]//Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2016: 16-25.
- [24] Ma Y, Kim M, Cao Y, et al. End-to-end scalable FPGA accelerator for deep residual networks[C]//Circuits and Systems (ISCAS), 2017 IEEE International Symposium on. IEEE, 2017: 1-4.
- [25] Y. Ma, et al. Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks, In ACM Int. Symp. on Field-Programmable Gate Arrays (FPGA), 2017.
- [26] Nurvitadhi E, Venkatesh G, Sim J, et al. Can fpgas beat gpus in accelerating next-generation deep neural networks?[C]//Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2017: 5-14.
- [27] Qingcheng Xiao, Yun Liang, Liqiang Lu, Shengen Yan, and Yu-Wing Tai. 2017. Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs. In Proceedings of the 54th Annual Design Automation Conference 2017. ACM, 62.
- [28] S. Williams, A. Waterman, and D. Patterson. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM, 52(4):6576, Apr. 2009.
- [29] LeCun Y, Kavukcuoglu K, Farabet C. Convolutional networks and applications in vision[C]/ISCAS. 2010, 2010: 253-256.
- [30] Chao Wang, Junneng Zhang, Xi Li, Aili Wang, Xuehai Zhou: Hardware Implementation on FPGA for Task-Level Parallel Dataflow Execution Engine. IEEE Trans. Parallel Distrib. Syst. 27(8):2303-2315(2016)
- [31] Zhang J, Li J. Improving the performance of opencl-based fpga accelerator for convolutional neural network[C]//Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2017: 25-34.
- [32] Rybalkin V, Pappalardo A, Ghaffar M M, et al. FINN-L: Library Extensions and Design Trade-off Analysis for Variable Precision LSTM Networks on FPGAs[J]. arXiv preprint arXiv:1807.04093, 2018.
- [33] Yu J, Hu Y, Ning X, et al. Instruction driven cross-layer CNN accelerator with winograd transformation on FPGA[C]//Field Programmable Technology (ICFPT), 2017 International Conference on. IEEE, 2017: 227-230.
- [34] Kim J H, Grady B, Lian R, et al. FPGA-based CNN inference accelerator synthesized from multi-threaded C software[C]//System-on-Chip Conference (SOCC), 2017 30th IEEE International. IEEE, 2017: 268-273.
- [35] Zhao R, Ng H C, Luk W, et al. Towards Efficient Convolutional Neural Network for Domain-Specific Applications on FPGA[J]. arXiv preprint arXiv:1809.03318, 2018.
- [36] Huang S, Jiang J, Dou Y, et al. Design and Implementation of Convolutional Neural Network Accelerator with Variable Layer-by-layer Debugging[C]//Proceedings of the 2018 2nd International Conference on Deep Learning Technologies. ACM, 2018: 1-6.
- [37] Geng T, Wang T, Sanaullah A, et al. A Framework for Acceleration of CNN Training on Deeply-Pipelined FPGA Clusters with Work and Weight Load Balancing[J].
- [38] Ma Y, Cao Y, Vrudhula S, et al. Automatic Compilation of Diverse CNNs onto High-Performance FPGA Accelerators[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2018.